Many of the figures and analysis methodology were based off of this tutorial from NYU. https://learn.gencore.bio.nyu.edu/rna-seq-analysis/gene-set-enrichment-analysis/
Information received from the csv samplesheet:
Metadata information received from the tsv table:
## replicate strain
## 1 rep1 hrde1
## 2 rep2 hrde1
## 3 rep3 hrde1
## 4 rep1 hrde2
## 5 rep2 hrde2
## 6 rep3 hrde2
## 7 rep1 wt
## 8 rep2 wt
## 9 rep3 wt
Here, count data from the RNA-seq experiment is read in the form of a counts matrix. Each column holds data from one sample, and each row represents a gene, such that the i-th row and n-th column tells how reads of gene i were measured in sample n. The values received should be un-normalized counts of sequencing reads or fragments.
# # Read Count Matrix
count.matrix <- read_tsv(trimws(sample_params$salmon_merged_gene_counts_file_path)) %>%
mutate(across(3:ncol(.), ~ as.integer(.x))) %>%
column_to_rownames("gene_id")
gene.reference <- dplyr::select(count.matrix, 1)
count.matrix <- count.matrix %>%
dplyr::select(-1)
Information entered as metadata describes the samples (columns) of the count matrix. This metadata is combined with the sample/column names from the count matrix so the accuracy of the metadata can be reviewed. Take time to review the table below.
# Design ColData Matrix
sample <- colnames(count.matrix)
colData <- data.frame(sample)
# Add Columns for Condition to colData
for (i in colnames(metadata)){
colData[i] <- metadata[i]
}
colData <- column_to_rownames(colData, "sample")
## replicate strain
## hrde1_rep1 "rep1" "hrde1"
## hrde1_rep2 "rep2" "hrde1"
## hrde1_rep3 "rep3" "hrde1"
## hrde2_rep1 "rep1" "hrde2"
## hrde2_rep2 "rep2" "hrde2"
## hrde2_rep3 "rep3" "hrde2"
## wt_rep1 "rep1" "wt"
## wt_rep2 "rep2" "wt"
## wt_rep3 "rep3" "wt"
Using the count matrix, column metadata, and user-inputed design formula (expressing the variables to be used in modeling), a DESeq data set is created.
The experimental variable and the associated control/experimental conditions are given by the user in “ref_level_cond” and “ref_level_value” in the sampleInfo file respectively.
Pre-filtering is performed, keeping only genes that have a count of at least 10 in a minimum number of samples. The minimum number of samples is decided by calculating the number of times the reference level value (control condition) is listed in the metadata with the assumption that experimental conditions will be repeated the same number of times.
Standard differential expression analysis is performed with the DESeq function, and regularized log transformation (rlog) transforms count data to a log2 scale for PCA.
# Create DESeq Data Set
dds <- DESeqDataSetFromMatrix(
countData = count.matrix,
colData = colData,
design = as.formula(sample_params$design)
)
# Set Reference Level
reflevelcond <- trimws(sample_params$ref_level_cond)
reflevelvalue <- trimws(sample_params$ref_level_value)
dds[[reflevelcond]] <- relevel(dds[[reflevelcond]], ref = reflevelvalue)
# Subset out genes with low overall counts
smallest.group.size <- length(grep(sprintf("^%s$",reflevelvalue), metadata[[reflevelcond]]))
keep <- rowSums(counts(dds) >= 10) >= smallest.group.size
dds <- dds[keep,]
# Run DESeq
dds <- DESeq(dds)
# Regularized Log Transform
rld <- rlog(dds)
If input data is designated to be batch corrected, the ComBat_seq() function from the SVA Bioconductor package is used. The batch (condition) to be regressed out is given in sampleInfo.csv as “batch_cond”, and the sample information for this condition is taken from the metadata.
The group argument of ComBat_seq() specifies biological covariates, whos signals should be preserved in adjusted data. All conditions (columns) from the metadata which are not to be regressed out are passed to the group argument. Differential expression analysis is otherwise performed as described above on batch-corrected counts.
if (as.logical(trimws(sample_params$batch_correct))) {
# Check that condition for batch correction matches column metadata
if(all(trimws(sample_params$batch_cond) != colnames(metadata))){
stop(sprintf("The given condition for batch correction:%s, doesn't match any prior conditions: %s", sample_params$batch_cond, colnames(metadata)))
}
# Replicate / Batch Correction
batch <- trimws(sample_params$batch_cond)
batch.correct <- ComBat_seq(
counts = as.matrix(count.matrix),
batch = metadata[[batch]],
group = metadata[[colnames(metadata)[colnames(metadata) != batch]]]) %>%
as.data.frame()
# Create DESeq Data Set
batch.correct.dds <- DESeqDataSetFromMatrix(
countData = batch.correct,
colData = colData,
design = as.formula(sample_params$design))
# Set Factor Level
reflevelcond <- trimws(sample_params$ref_level_cond)
reflevelvalue <- trimws(sample_params$ref_level_value)
batch.correct.dds[[reflevelcond]] <- relevel(batch.correct.dds[[reflevelcond]], ref = reflevelvalue)
# Subset out genes with low overall counts
smallest.group.size <- length(grep(sprintf("^%s$",reflevelvalue), metadata[[reflevelcond]]))
keep <- rowSums(counts(batch.correct.dds) >= 10) >= smallest.group.size
batch.correct.dds <- batch.correct.dds[keep,]
# Run DESeq
batch.correct.dds <- DESeq(batch.correct.dds)
# Regularized Log Transform
batch.correct.rld <- rlog(batch.correct.dds)
}
Reference level condition and value (ref_level_cond and ref_level_value respectively) inputs from sampleInfo.csv are used to create contrast terms to compare treatment samples with control samples. This relies on the design formula being formatted correctly to produce comparisons relevant to the experimental condition.
# # Find relevant contrasts
contrasts <- list()
# Factor values of experimental condition
pattern <- factor(metadata[[reflevelcond]])
comp <- levels(pattern)
# Remove control state from experimental condition vector
comp <- comp[comp != reflevelvalue]
# Create Contrasts
for (i in 1:length(comp)){
contrasts[[i]] <- c(reflevelcond, comp[i], reflevelvalue)
}
Based on the number of experimental conditions (identified in the ref_level_cond column of the metadata) and the contrast terms comparing them to the experimental control (ref_level_value), differential expression results are extracted from the DESeq data set object.
# Find results from relevant contrasts
results <- list()
# Pull results from DESeq object with contrasts
for (i in 1:length(contrasts)){
results[[i]] <- list()
results[[i]][["DataFrame"]] <- results(dds.for.analysis, contrast = contrasts[[i]]) %>%
data.frame() %>%
rownames_to_column("gene_id") %>%
mutate(gene_name = gene.reference[gene_id, 1], .before=baseMean) %>% # Add gene names to DESeq results matrices
column_to_rownames("gene_id")
results[[i]][["DESeqDataSet"]] <- results(dds.for.analysis, contrast = contrasts[[i]])
}
The Bioconductor package PCAtools is used to perform principle component analysis on regularized log transformed count data. Scree plots show principle component numbers on the x-axis, and their respective eigenvalues on the y-axis.
The third graph plots experimental metadata against a number of PCs and their gene loadings to visualize the agreement with the gene expression pattern of each condition for each PC.
Inputs for function call:
compileReport(result.obj = results[[i]], contrast = sprintf("%s_%s_vs_%s",
contrasts[[i]][1], contrasts[[i]][2], contrasts[[i]][3]),
sample_name = trimws(sample_params$sample_name), l2fc_filter = 0.8,
padj_filter = 0.1)
Transfering WORMBASE gene IDs to ENTREZID:
308 gene IDs were not transfered from WORMBASE to ENTREZID
12108 gene IDs were successfully transfered from WORMBASE to ENTREZID
Number of highly variable genes identified (with | L2FC | > 0.8 and padj < 0.1) is:
495
Number of highly upregulated genes identified (with L2FC > 0.8 and padj < 0.1) is:
377
Number of highly downregulated genes identified (with L2FC < -0.8 and padj < 0.1) is:
118
protein phosphorylation
phosphorylation
dephosphorylation
ribonucleoprotein complex biogenesis
tRNA metabolic process
collagen and cuticulin-based cuticle development
male gamete generation
mitochondrial translation
mitochondrial gene expression
RNA modification
rRNA processing
mitochondrial respiratory chain complex assembly
defense response to symbiont
immune response
immune system process
cellular respiration
establishment of protein localization to membrane
developmental process involved in reproduction
neuropeptide signaling pathway
cell fate commitment
Mitochondrial translation termination
Translation
Mitochondrial translation
Mitochondrial translation elongation
SRP-dependent cotranslational protein targeting to membrane
GTP hydrolysis and joining of the 60S ribosomal subunit
Formation of a pool of free 40S subunits
Nonsense Mediated Decay (NMD) independent of the Exon Junction Complex (EJC)
Physiological factors
Nonsense-Mediated Decay (NMD)
Nonsense Mediated Decay (NMD) enhanced by the Exon Junction Complex (EJC)
Metabolism of RNA
Table Reactome Pathway Over-representation Analysis of Downregulated Genes has no rows.
Inputs for function call:
compileReport(result.obj = results[[i]], contrast = sprintf("%s_%s_vs_%s",
contrasts[[i]][1], contrasts[[i]][2], contrasts[[i]][3]),
sample_name = trimws(sample_params$sample_name), l2fc_filter = 0.8,
padj_filter = 0.1)
Transfering WORMBASE gene IDs to ENTREZID:
308 gene IDs were not transfered from WORMBASE to ENTREZID
12108 gene IDs were successfully transfered from WORMBASE to ENTREZID
Number of highly variable genes identified (with | L2FC | > 0.8 and padj < 0.1) is:
1557
Number of highly upregulated genes identified (with L2FC > 0.8 and padj < 0.1) is:
1051
Number of highly downregulated genes identified (with L2FC < -0.8 and padj < 0.1) is:
506
dephosphorylation
protein phosphorylation
phosphorylation
defense response to bacterium
male gamete generation
neuropeptide signaling pathway
sexual reproduction
mitochondrial transmembrane transport
Table Reactome Pathway GSE Of All Differentially Expressed Genes has no rows.